feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP]#10462
Draft
localai-bot wants to merge 145 commits into
Draft
Conversation
Host-side paged-attention block manager ported faithfully from vLLM V1 (block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py): - KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal) - BlockPool: get_new_blocks / touch / free_blocks eviction ordering / cache_full_blocks / lazy eviction on reuse - PagedKVManager: on-demand allocate, block_table, slot arithmetic (slot = block_id*block_size + offset), free - Prefix caching: chained block hashing + find_longest_cache_hit (first-miss stop), enabling automatic cross-tenant prefix sharing Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes. Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness, benchmark wins, prefix-share serving) follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Validate the paged KV read/write path at the ggml-op level, driven by
PagedKVManager:
- write: ggml_set_rows(pool, k_src, slot_mapping) scatter K rows by slot
- read: ggml_get_rows(pool, gather_idx) gather a seq's slots into
contiguous scratch (the tensor an attention kernel consumes)
The test forces a non-contiguous, out-of-order physical block layout
(allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and
proves gather(write(x)) == x plus cross-sequence isolation in the shared
pool. This de-risks the central question (does slot-addressed paged storage
round-trip correctly through ggml) before the llama-graph integration.
Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring
how llama.cpp allocates its KV cache. CPU backend, no new ggml op.
Built against ggml from the vendored llama.cpp checkout.
Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Retire the central numeric risk from the design: feeding gather-to-scratch KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5]) into ggml's standard attention ops produces correct attention. Path under test: set_rows write -> get_rows gather (K and V) -> mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared against an independent host-computed softmax attention over the same K/V/Q. Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4). This proves the paged read path is numerically sound on CPU with no new ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate Gate 0 (token-identical greedy generation in a real model). Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:
WIN 1 concurrency capacity @ 512-block budget
contiguous (reserve n_ctx/seq): 4 sequences
paged (on-demand blocks): 37 sequences
--> 9.2x more concurrent sequences
WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
prefix-cache OFF: 2176 physical blocks
prefix-cache ON: 192 physical blocks
--> 11.3x less KV memory
WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.
Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact remaining work: wire build_attn_paged into llama-graph.cpp and validate token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput. Records the integration seams (create_memory, find_slot, get_k/get_v, build_attn, mask) and the honest caveats (unified cache already shares a pool; vLLM's classic kernel is deprecated) so the next session starts warm. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…KV placement Wire paged, non-contiguous fixed-size BLOCK placement into the real llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to the contiguous cache while its KV is physically scattered across permuted blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via LLAMA_KV_PAGED_DEBUG, not a silent fallback. This retires the correctness premise of paged attention IN THE MODEL (not just at the ggml-op level): attention is invariant to physical KV placement, because reads use per-cell pos/seq metadata for masking. The patch lives at patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f). Scope: storage/placement layer, single sequence. Remaining (P4): the gather-read compute path (attend only a seq's own blocks) for the throughput win, and the multi-sequence driver. README updated with repro + status. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8 MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the lever plan. Measured wins: - Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48), near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap). - Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch. Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8 GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and the complete paged attention (on-demand alloc + gather-read + continuous batching + prefix sharing). Honest scope: each is multi-week kernel/systems work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048. When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on Blackwell instead of 512; explicit `batch:` always overrides. Detection is a shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap >= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at the common ModelOptions builder, so it covers the C++ llama.cpp backend too. Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316 t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is prefill via Blackwell FP4 tensor cores. Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the no-Blackwell branch for determinism. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler. Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…m ggml issue draft Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…s from-scratch No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/ draft); CUTLASS not a dep; no fork has one; activation-quant gather already fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months, maintainers deferring to cuTile). No tractable patch closes the 27x. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ttention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the
CPU-verified vLLM-parity block manager) + CMake entry. No behavior change.
Generated against the pinned LLAMA_VERSION; applies clean.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ical find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy generation is token-identical to stock (verified on Qwen3-0.6B at the pin), branch confirmed firing. Default off. The placement substrate for the gather-read. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by one aligned index; n_kv compaction; gated so stock stays byte-identical) with the token-identical gate and the known risks (mask transpose layout, v_trans). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… single-stream first Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix). Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the engine's domain (0004 pool + 0005 continuous batching target it). Baseline for the series to improve on. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…is 54.6% MoE GEMM too Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…Lever 3)
The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37%
prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven).
Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu}
entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always
falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation
phases + parity harness + the dense-path follow-up (#28).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…PP 7.6-32x) vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765, vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision: the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just the MoE grouped GEMM (dense GEMM is the simpler first kernel to land). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…n grouped) Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both share the e2m1 block-scaled collective; dense is grouped-with-one-group. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ry flag lever exhausted Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel; no config/flag shortcut remains. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…line Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s (BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is 2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4 MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… (~17% of ceiling) MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17% of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ever Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound). vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3 head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate (TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch. Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill + n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend sampling. Cross-cutting: spec-dec for dense. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… plan Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile #ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048 690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121 GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is required. Added the phased W4A16 Marlin-style implementation plan (P0 harness -> P5 enable) as the committed multi-week build; corrected the cuBLAS note. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…-vs-npl plots Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship: the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0 on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides). - final_benchmark.csv: clean 8-column plot-ready schema (model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb), 16 rows (2 models x 2 engines x npl 8/32/64/128). - QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on both engines, no stale-baseline carry-over). - decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels. Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode (ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's flat ~107 GB pool). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…(benchmark finding) The final benchmark exposed TTFT as the weakest number (dense npl128 903s vs vLLM 6-18s, decode-first budget throttling burst-prefill) plus a concrete paged-pool burst-degradation bug (post-burst low-npl prefill collapses 507->65 t/s; decode unaffected). Highest-value serving fix; decode + memory already strong. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023): - attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a no-op; q8_0 within noise -> KV dtype is not a decode lever - nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn, excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12% incl. the non-FP4 cublas GEMM) - f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the 8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesize the GPU kernel-budget probe with the read-only glue source map. Add (4) the implementation cost - llama has no model-compute-dtype knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes F32 output), so f16 glue is not a flag but an opt-in multi-file change (norm.cu f16 kernels + f16 residual stream). Add the final verdict: precision is not the dominant cause of the 8% residual (83% of the step is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is non-bit-exact, so do not build it as the default; ship the 95%-bit-exact f32 plateau and target the structural cublas/graph-launch ~3-4% instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… paged-pool burst bug as first build target Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM, B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph): - A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is bounded on this bandwidth-bound a3b model. Keep down_proj quantize retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a later shared GDN+MoE decode-graph project. - B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair vs vLLM. Dead end. Rank last. - C: paged-pool burst-degradation BUG (Part 2) is a true correctness defect (prefill collapses 507->65 t/s after a burst, restart cures it): reclamation gap on partial seq_rm + free-queue fragmentation. Plus the static decode-first budget (Part 1) explains 903s/213s burst TTFT and the chunked-interleave fix. - D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's bimodality was the shelved BF16 SSM path. Closed. First build target: the paged-pool burst-degradation bug fix (Fix-1 truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot- completion). Small, localized, default-off byte-identical, crisp repro (npl64 burst then npl8: prefill within 10% of fresh + num_free restored). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…se) (patch 0024) Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill draws from a depleted/fragmented pool and its throughput collapses (the benchmark's "restart per npl" crutch). Decode is unaffected. The fix changes only host-side block accounting and placement, never KV values or compute, and is gated behind LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior). Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly. Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving content-cache hashes. Fix-3 release on slot completion: server_slot::release() issues prompt_clear() under the paged engine so a finished-idle slot returns its blocks promptly. Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023): - Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops unaffected (no ggml op touched). - Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores ascending popleft order. UNIT PASS. - Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998; 64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527). - Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses 488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to 532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96 (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both arms (bit-exact serving). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on), shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp). Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16), mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…in -> 9d5d882d) Sync to master (12 commits) + the llama.cpp pin bump 8be759e6 -> 9d5d882d. Conflicts resolved: - Makefile .NOTPARALLEL: union (keep both backends/llama-cpp-localai-paged and master's backends/privacy-filter-darwin). - gallery/index.yaml: our 2 base NVFP4 entries (qwen3.6-27b-nvfp4, qwen3.6-35b-a3b-nvfp4) for the paged backend prepended to master's full list; master keeps its own *-nvfp4-mtp variants (distinct entries). Go build + YAML validated; the 8 duplicate gallery names are pre-existing in master, not introduced here. The patchset still needs re-verification against the new tip (pin-sync, next step). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ches) The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the paged patch-stack (0001-0024) to the new tip: the stack was rebased onto 9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated bit-exact before re-exporting the LocalAI .patch files. Re-exporting each shipped patch from its rebased commit and diffing body-to-body against the committed files identifies exactly 4 that changed and no longer git-apply to 9d5d882d: - 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to the refactored update_slots() lambda (continue->return, batch.n_tokens-> batch.size()); identical env-guarded logic. - 0013 static prefill budget: budget var-block / while-gate / admission-break re-expressed against the refactored loop (add_ok=false idiom). - 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The inserted lines are unchanged. (This one rebased cleanly via 3-way but its committed .patch no longer applies with plain git apply, so it is caught by the per-patch apply-check, not by the rebase conflict count.) - 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch = batch.size() + add_ok=false against the refactored loop. All four are byte-faithful format-patch exports of the gate-green rebased commits. Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green tree byte-for-byte across every code file. The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only (hunk bodies byte-identical, only @@ line-numbers shifted) and still apply cleanly, so they are left unchanged. The remaining patches are identical. Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121): - test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 all OK. - greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd, both == baseline. - decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s (ref 745.7, +0.7%), both within noise. Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…findings Mirror of llama.cpp dev-tree patch 0025 (qwen35moe NVFP4 MoE-decode re-graph) and the GPU-agent B findings in SPEEDUP_HUNT.md: re-confirmed MoE decode decomposition @npl128, the measured re-graph lever (+4.4%/+2.9%/+1.9% decode_agg at npl 32/64/128; bit-exact: test-backend-ops MUL_MAT_ID 806/806 + parallel-greedy np16 byte-identical ON==OFF), grouped-GEMM occupancy headroom (exhausted on this bandwidth-bound model), and the W4A16 assessment (rejected: non-bit-exact, slower BF16 MMA). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Append lever C (structural dense residual: lm_head + scheduling) findings and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard (gain x tractability x gate), ranked build order, the concrete A build plan for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue with each one's build trigger. Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact, dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet recurrent state per head: f32 on long-memory heads (where bf16 rounding does not contract and the KL error concentrates), bf16 on fast-decaying heads, classified at model load by tau_h = 1/(|ssm_a|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh = 0.0 keeps every head f32 (bit-exact opt-out). De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models (dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd). Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode (ids in-place path) is incoherent; classifier/cache/kernel-params verified correct, bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md. Not ready for the GateSweep until fixed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]
…gate sweep (patch 0026) Regenerate patch 0026 with the hybrid-decode carry fix and record the KL/throughput gate-sweep results. Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot maps, which were uploaded only once at construction; after the post-warmup reset every head read head_slot==0 (f32-local-0), collapsing the split and producing incoherent decode. Persist head_slot_host and re-upload via upload_head_slots() after every buffer clear. Hybrid decode is now coherent and the cross-op state carry is byte-exact (write==read, both partitions). Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline, both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error concentrates in long-memory heads is refuted: KL scales with the bf16 head count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is opt-in only and not recommended in the gallery config. Full tables in A_HYBRID_SSM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…droom) B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted - it does not lift. Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128 ~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this 256-tiny-expert BW-bound model. Bit-exactness proven: md5 identical at block 64/128/256 for both models (the per-thread quant body is untouched; thread->output map is invariant to blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM). No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ng ~85% of vLLM B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA mul_mat_q) was built bit-exact on the clean 0025 base and measured: the grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4 warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT (npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows +42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS. No patch 0028; DGX dev tree reverted to pristine 0025. Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95% but is non-bit-exact and fails the MoE KL gate). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… mode Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 = bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML `options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks. - grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option -> setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It does NOT reference the paged-only common_params field, so the turboquant fork (which lacks patch 0026) stays byte-clean. - patch 0026 (common.cpp common_context_params_to_llama): getenv fallback feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the --ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched, so stock stays bit-exact; the CLI flag takes precedence when set. - docs: backend/index.yaml note, docs backends.md, gallery header NOTE (referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact). Byte-safe when unset: with no ssm_bf16_tau option the env is never touched and the default f32 bit-exact recurrence is preserved. Verified the parse + consume code paths with a standalone compile-and-run (option string -> LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…e Marlin GEMM Ground-truth side-by-side per-kernel ms/step of the MoE decode gap on DGX GB10. llama (752 t/s, step 169.8ms) vs vLLM graphs-on (901-equiv, step 142.0ms): 27.8ms gap. Headline: the grouped MoE-expert GEMM is a llama WIN - native FP4-MMA W4A4 47.3ms vs vLLM Marlin W4A16 50.0ms at the tiny-M decode shape. A Marlin-style W4A16 MoE GEMM would be slower; it is not the lever (extends the w4a16-marlin DENSE verdict). The 15% lives elsewhere: bf16 projections + convert glue (+6.5ms), recurrence state-gather plumbing (+6.6ms, led by k_get_rows 5.2ms), graph coverage + stream overlap (~+7ms), W4A4 act-quant tax (+3.3ms), router/glue (+5.4ms). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…GEMM is a llama win Cross-agent synthesis on top of the both-engine nsys decomposition (3b59571): settle the user's "can we do what vLLM does on MoE?" question with the three converging investigations (groundtruth measurement + vllm-marlin source-read + marlin-port feasibility). Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a -1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16 dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax (+3.3), and router/glue (+5.4). A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy wall the dense scaffold already STOPPED at, and its entire intrinsic upside is the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph already banked, and closeable bit-exactly by fusing the act-quant. Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant ~96-97%, parity-or-better physically in reach since both heaviest kernels (SSM core, MoE GEMM) are already llama wins. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fuse the residual k_get_rows_float in the gated-DeltaNet decode path (the biggest single kernel vLLM lacks per MOE_GAP_VS_VLLM.md, ~5.2 ms/step MoE). 0019 fused the SSM-state gather, 0021 fused the conv compute but kept a build_rs gather for the conv taps; nsys located that conv-state tap gather (n_embd_r=24576 floats x 128 seqs, ~720 x ~115 us per 24-step window) as the last k_get_rows in the GDN path. New op ggml_ssm_conv_update_inplace_ids reads each sequence's prior conv taps from cache[ids[s]] in-kernel (identity in place from the write slot, non-identity via a disjoint scratch), mirroring the 0019 in-place + ids fusion. Bit-exact: read VALUES unchanged, only the read path changes. Helps both dense and MoE (shared GDN conv). GATE test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS, SSM_CONV_UPDATE, SSM_CONV, GATED_DELTA_NET, GET_ROWS all PASS. GATE greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models: q36-27b-nvfp4 5951a5b4..., q36-35b-a3b-nvfp4 07db32c2... nsys: k_get_rows<float,float> 10174 -> 9454 instances, 186.3 -> 102.8 ms (720 conv gathers eliminated, replaced by a ~1.1 us no-op gather). Built and gated on the DGX llama tree (branch paged, commit 944636c, f32 default). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add qwen3.6-27b-nvfp4-mtp-paged and qwen3.6-35b-a3b-nvfp4-mtp-paged: the existing michaelw9999 NVFP4-MTP GGUFs (same uri/sha256/filename and the recommended Qwen3.6 sampling defaults) wired to backend llama-cpp-localai-paged with our optimized paged options (f16, flash attention, 128k context, gpu_layers 99, batch 512, paged_kv, decode-first max_batch_tokens, kv_unified:false, parallel:128). These coexist with the stock llama-cpp *-nvfp4-mtp entries (distinct -paged names) so the four LocalAI-paged NVFP4 entries sit together at the top of the gallery. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Rename the two base NVFP4 entries to a consistent -paged suffix (qwen3.6-27b-nvfp4 -> qwen3.6-27b-nvfp4-paged, qwen3.6-35b-a3b-nvfp4 -> qwen3.6-35b-a3b-nvfp4-paged) so all four base/MTP paged entries share the naming convention. Update the two matching examples in the backend plan doc. Add qwopus3.6-27b-v2-mtp-nvfp4-paged and qwopus3.6-27b-coder-mtp-nvfp4-paged: verbatim copies of the stock qwopus NVFP4-MTP entries (same GGUF uri/sha256, sampling, template, tags, function block) rewired onto the LocalAI paged-attention stack (backend llama-cpp-localai-paged; f16, flash_attention, 131072 context, 99 gpu_layers, batch 512; paged_kv + max_batch_tokens:512 + kv_unified:false + parallel:128). The stock entries are left untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The dense + MoE base NVFP4 GGUFs are live (huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF and .../Qwen3.6-35B-A3B-NVFP4-GGUF), sha256 verified vs the Hub LFS hash, uris resolve. Replaces the placeholder/not-yet-published TODO. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…-attention # Conflicts: # gallery/index.yaml
…tion (patch 0028) Anchors the rigorous same-session A/B validation of patch 0028 (residual conv-state tap k_get_rows fusion) on this worktree branch with sign-off attribution. The regenerated 0028 patch + bench-updated LEVER1_GATHER_RESULTS.md first landed via a concurrent origin/master merge (c1f1d1e) that swept the staged files; this records the provenance and the bench summary in the checkpoint. Gate (bit-exact, greedy --temp 0 --seed 1 -n 48): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (both == baseline; base == lever1). decode_agg npl128: dense 369.95 -> 377.83 t/s (+2.13%, 96.6% of vLLM), MoE 763.47 -> 777.95 t/s (+1.90%, 86.3% of vLLM). nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920), 358.37 -> 133.52 ms, step -3.13 ms. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…tions (+6.5ms bucket) Scopes lever 4 (read-only, no GPU) on top of the flat levers 2+3. Root cause: the MoE GGUF (nvidia modelopt, 241 NVFP4 tensors) quantized only the experts and left the GDN/attn linear projections in BF16, while the dense GGUF (unsloth, 304 NVFP4 tensors) already has them NVFP4 (proven: dense ssm_out runs FP4 MMQ; dense decode at 96.6% of vLLM). Lever 4 = re-quantize the MoE GGUF's bf16 GDN/attn projections to NVFP4, the same move vLLM makes on the identical weights - the +6.5ms projections bucket, the largest single banked MoE gain available. Path: offline re-quantize to a new GGUF variant (expanded --tensor-type); zero kernel code - the loader sidecar-scale path + tuned mul_mat_q<NVFP4> are already in tree and proven by the dense GGUF. Bit-changing => KL-gate, not md5. KL expected to pass (per-step non-accumulating weight quant, unlike the failed bf16-state; experts already W4A4-clean); lm_head is the one risky tensor (gate on argmax-agreement). Expected ~+4-6.5ms => MoE 86.3% -> ~88-91% of vLLM. Recommend a separate OPT-IN gallery variant (preserve the bit-exact default; promote to default only if the KL gate is clean). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…L, no-ship Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4 scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline: PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172 vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so the damage is in the bulk attn/GDN projections. Root cause: the bf16 projections are a deliberate modelopt precision choice, not a provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms projection-GEMM bucket is the price of high-precision projections that vLLM also pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Conclude the MoE-parity hunt. The two remaining sub-levers in the 20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor: - convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32 GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that bf16-rounded value is load-bearing for the shipped md5. Removing the round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all rebaseline md5. A precision boundary, like lever 4. - bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor (~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet already TMA-streams the weights; cutlass reads the same bytes. No kernel lever; only fewer bytes (quantize) helps - rejected on quality. Corrects the body premise that vLLM runs these projections as NVFP4-Marlin: vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the projection bucket is a matched-precision gap, not a quant gap. Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5), gated behind resolving the paged-MoE baseline md5 drift. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror of paged-dev commit e2acb3b (lever 5). get_block_table() is recomputed once per full-attention layer per decode step, but the KV cell layout is fixed for the whole step (it only changes in apply()). This caches the table the first time it is built in a step and memcpy-reuses the identical bytes for the rest, invalidating in apply(). Bit-exact; toggle off with LLAMA_PAGED_NO_BT_CACHE=1. Host-side get_block_table time (llama-batched-bench, npp128 ntg128 npl128, cache OFF -> ON): MoE 112.94 -> 14.82 ms (-87%), dense 193.78 -> 16.90 ms (-91%). Dense decode is partly host-bound and gains (TG 364.8 -> 374.7 t/s, ~96% of the vLLM 391 t/s @npl128 reference); MoE decode is compute-bound (FP4 GEMM) so the saved host time is off the critical path and MoE TG is flat. Details in LEVER5_HOSTPIPE_RESULTS.md. Also records the per-path bit-exactness gate (PAGED_BITEXACT_NOTE.md): the paged-MoE greedy md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a benign FP-accumulation-order difference of the paged attention reduction, not a bug. KL-validated vs the f16 reference (16 chunks, c512): KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, PPL(paged) = 7.4009 ~ PPL(nonpaged) = 7.3896 (within +/- 0.29). Canonical references are now per path: non-paged MoE 07db32c2 and paged MoE 8cb0ce23; dense is bit-exact across paths (5951a5b4). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: draft / WIP - opened to track ongoing GB10 enterprise-serving work. Large branch (kernel experiments + analysis + the shippable feature); will be curated before any merge.
What this is
Vendored, opt-in paged KV cache + cross-request prefix sharing for the llama.cpp backend, plus GB10 (consumer Blackwell, sm_121) decode-path optimization and the supporting analysis. All paged behaviour is gated by
LLAMA_KV_PAGED(env) / thekv_pagedserver option and is off by default - stock builds are byte-identical.Shippable feature pieces
backend/cpp/llama-cpp/patches/paged/0001-0011- vendored llama.cpp patch series, applied behind theLLAMA_PAGEDbuild flag (patches/paged/, default on;LLAMA_PAGED=offgives a clean upstream checkout). Isolated inprepare.sh+Makefilewith a sentinel guard against double-apply.grpc-server.cpp-kv_pagedper-server option (0005) + cross-request prefix share wired intoupdate_slots(0008).core/backend/hardware_defaults.go,pkg/xsysinfo/gpu.go- hardware-aware default consolidation.Key results (measured on DGX Spark / GB10, Qwen3-32B NVFP4)
Analysis docs live under
backend/cpp/llama-cpp/patches/paged/*.mdandbackend/cpp/llama-cpp/paged/*.md.Next
Not for merge as-is
This branch also contains banked W4A16/Marlin kernel experiments and NVFP4/MXFP4 quality analysis that informed the direction but are not part of the feature. Those will be dropped/split before merge.